Supplemental Material Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text
نویسندگان
چکیده
We use C to denote a generic count of co-occurrences, for example, Ctdk is the number of words in document d at epoch t that are generated from topic k. We might remove a dimension to denote summation, for example, Ctd. is the total number of words in document d at epoch t and Ct.k is the total number of words generated from topic k at epoch t. Finally, we use a negative sign in the superscript to denote exclusion, for example, C−tdi tdk is the same quantity as Ctdk without the contribution of word i, although sometimes we abuse notation and use i if the meaning is clear from the context.
منابع مشابه
Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text
We present the time-dependent topic-cluster model, a hierarchical approach for combining Latent Dirichlet Allocation and clustering via the Recurrent Chinese Restaurant Process. It inherits the advantages of both of its constituents, namely interpretability and concise representation. We show how it can be applied to streaming collections of objects such as real world feeds in a news portal. We...
متن کاملOnline Latent Dirichlet Allocation with Infinite Vocabulary
Topic models based on latent Dirichlet allocation (LDA) assume a predefined vocabulary. This is reasonable in batch settings but not reasonable for streaming and online settings. To address this lacuna, we extend LDA by drawing topics from a Dirichlet process whose base distribution is a distribution over all strings rather than from a finite Dirichlet. We develop inference using online variati...
متن کاملOnline Variational Inference for the Hierarchical Dirichlet Process
The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric model that can be used to model mixed-membership data with a potentially infinite number of components. It has been applied widely in probabilistic topic modeling, where the data are documents and the components are distributions of terms that reflect recurring patterns (or “topics”) in the collection. Given a document collect...
متن کاملOnline Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features
Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...
متن کاملNonparametric Bayesian Storyline Detection from Microtexts
News events and social media are composed of evolving storylines, which capture public attention for a limited period of time. Identifying storylines requires integrating temporal and linguistic information, and prior work takes a largely heuristic approach. We present a novel online non-parametric Bayesian framework for storyline detection, using the distance-dependent Chinese Restaurant Proce...
متن کامل